{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Advanced Modules Exercise Solutions\n",
    "\n",
    "It's time to test your new skills, this project will combine multiple skills sets, including unzipping files with Python, using os module to automatically search through lots of files.\n",
    "\n",
    "## Your Goal\n",
    "\n",
    "There is a .zip file called"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unzipping the File\n",
    "\n",
    "We can easily use the shutil library to extract and unzip the contents of the .zip file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import shutil"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "shutil.unpack_archive('Mission_Orange_Files/unzip_me_for_instructions.zip','Mission_Orange_Files/extracted_content','zip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regular Expression to Find the Link\n",
    "\n",
    "There are many approaches to take here, but since we know we are looking for a website, there should be a www. somewhere in the string. So let's search for strings where the www. is somewhere there. Other approaches could have included searching for .coms, .orgs, or other endings, or trying to search for http or https. [More info on other possible regex patterns for domains](https://stackoverflow.com/questions/2490310/regular-expression-for-checking-website-url)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Any word that starts with https:// and for the rest of the word one or more, \\w dots, slashes, ? , =, _ or -\n",
    "pattern = r'https://[-?_=./\\w]+'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "test = \"here is a website https://drive.google.com/open?id=1tWJBFrSQL06qTZgkohs4t_a5Cu84AheLo and https://docs.google.com/document/d/Q-DcxM17nJm_El0aGsNnY7FajaogRviwja/edit let's see if we can find it.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https://drive.google.com/open?id=1tWJBFrSQL06qTZgkohs4t_a5Cu84AheLo',\n",
       " 'https://docs.google.com/document/d/Q-DcxM17nJm_El0aGsNnY7FajaogRviwja/edit']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(pattern,test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def search(file,pattern= r'https://[-?_=./\\w]+'):\n",
    "    f = open(file,'r')\n",
    "    text = f.read()\n",
    "    \n",
    "    if re.search(pattern,text):\n",
    "        return re.search(pattern,text)\n",
    "    else:\n",
    "        return ''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## OS Walk through the Files to Get the Link\n",
    "\n",
    "Now that we have a basic function to search through the text of the files, let's perform an os.walk through the unzipped directory to find the links hidden somewhere in one of the text files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "results = []\n",
    "for folder , sub_folders , files in os.walk(os.getcwd()+\"\\\\Mission_Orange_Files\\\\extracted_content\"):\n",
    "    \n",
    "    for f in files:\n",
    "        full_path = folder+'\\\\'+f\n",
    "         \n",
    "        results.append(search(full_path)) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://drive.google.com/open?id=1tWJBFrSQL06qTZgkohs4t_a5Cu84AheLo\n",
      "https://docs.google.com/document/d/1tWJBFrSQL06qTZgkohs4t_a5Cu84AheLocWTXQ-DcxM/edit?usp=sharing\n",
      "https://drive.google.com/file/d/17nJm_El0aGsNvaMZtnY7FajaogRviwja/view?usp=sharing\n"
     ]
    }
   ],
   "source": [
    "for r in results:\n",
    "    if r != '':\n",
    "        print(r.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Excellent work! More information on this phone number:\n",
    "* https://www.npr.org/2011/12/21/144069758/callin-oates-the-hotline-you-dont-need-but-might-call-anyway\n",
    "* https://twitter.com/CallinOates"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
